Congestion management in a distributed computer system multiplying current variable injection rate with a constant to set new variable injection rate at source node

ABSTRACT

A distributed computer system includes links and routing devices coupled between the links and routing frames between the links. Each of the routing devices includes a congestion control mechanism for detecting congestion at the routing device and responding to detected congestion by gradually reducing an injection rate of frames routed from the routing device.

This application claims the benefit of 60/135,664, filed May 24, 1999and claims the benefit of 60/154,150, filed Sep. 15, 1999.

THE FIELD OF THE INVENTION

The present invention generally relates to communication in distributedcomputer systems and more particularly to congestion management indistributed computer systems.

BACKGROUND OF THE INVENTION

In conventional distributed computer systems, distributed processes,which are on different nodes in the distributed computer system,typically employ transport services, to communicate. A source process ona first node communicates messages to a destination process on a secondnode via a transport service. A message is herein defined to be anapplication-defined unit of data exchange, which is a primitive unit ofcommunication between cooperating sequential processes. Messages aretypically packetized into frames for communication on an underlyingcommunication services/fabrics. A frame is herein defined to be one unitof data encapsulated by a physical network protocol header and/ortrailer.

Messages communicated over the underlying communication services/fabricscan often experience congestion for various reasons, such as head ofline blocking. There are conventional congestion control mechanisms.Congestion control mechanisms typically fall into three categories whichinclude congestion detection mechanisms; congestion reportingmechanisms; and congestion response mechanisms. Congestion reportingmechanisms report the occurrence of congestion provided from congestiondetection mechanisms possibly for short term use in alleviatingcongestion and possibly for long term network management. The congestionresponse mechanisms attempt to alleviate or remove congestion.Congestion in large distributed computer systems is a significantproblem today, especially in infrastructures of remote computer systemshaving congestion resulting from message traffic over an internet orintranet coupling the remote computer systems.

For reasons stated above and for other reasons presented in greaterdetail in the Description of the Preferred Embodiments section of thepresent specification, there is a need for an improved congestionmanagement architecture for distributed computer systems to alleviatecongestion problems in the distributed computer systems resulting fromcommunicating messages between remote processes over the underlyingcommunication services/fabrics. Such an improved congestion managementarchitecture should provide congestion detection mechanisms; congestionreporting mechanisms; and congestion response mechanisms whichefficiently operate together to better address congestion problemsencountered today in infrastructures of remote computer systemsconnected by an internet or an intranet.

SUMMARY OF THE INVENTION

The present invention provides a distributed computer system havinglinks and routing devices. The routing devices are coupled between thelinks and route frames between the links. Each of the routing devicesincludes a congestion control mechanism for detecting congestion at therouting device and responding to detected congestion by graduallyreducing an injection rate of frames routed from the routing device.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of a distributed computer system for implementingthe present invention.

FIG. 2 is a diagram of an example host processor node for the computersystem of FIG. 1.

FIG. 3 is a diagram of a portion of a distributed computer systememploying a reliable connection service to communicate betweendistributed processes.

FIG. 4 is a diagram of a portion of distributed computer systememploying a reliable datagram service to communicate between distributedprocesses.

FIG. 5 is a diagram of an example host processor node for operation in adistributed computer system implementing the present invention.

FIG. 6 is a diagram of a portion of a distributed computer systemillustrating subnets in the distributed computer system.

FIG. 7 is a diagram of a switch for use in a distributed computer systemimplemented the present invention.

FIG. 8 is a diagram of a portion of a distributed computer system.

FIG. 9A is a diagram of a work queue element (WQE) for operation in thedistributed computer system of FIG. 8.

FIG. 9B is a diagram of the packetization process of a message createdby the WQE of FIG. 9A into frames and flits.

FIG. 10A is a diagram of a message being transmitted with a reliabletransport service illustrating frame transactions.

FIG. 10B is a diagram illustrating a reliable transport serviceillustrating flit transactions associated with the frame transactions ofFIG. 10A.

FIG. 11 is a diagram of a layered architecture for implementing thepresent invention.

FIG. 12 is a diagram of a simple tree configuration having mixedbandwidth lengths and adaptable links.

FIG. 13 is a diagram of a simple tree with mixed bandwidth lengths andadapter and router links.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following detailed description of the preferred embodiments,reference is made to the accompanying drawings which form a part hereof,and in which is shown by way of illustration specific embodiments inwhich the invention may be practiced. It is to be understood that otherembodiments may be utilized and structural or logical changes may bemade without departing from the scope of the present invention. Thefollowing detailed description, therefore, is not to be taken in alimiting sense, and the scope of the present invention is defined by theappended claims.

An example embodiment of a distributed computer system is illustratedgenerally at 30 in FIG. 1. Distributed computer system 30 is providedmerely for illustrative purposes, and the embodiments of the presentinvention described below can be implemented on computer systems ofnumerous other types and configurations. For example, computer systemsimplementing the present invention can range from a small server withone processor and a few input/output (I/O) adapters to massivelyparallel supercomputer systems with hundreds or thousands of processorsand thousands of I/O adapters. Furthermore, the present invention can beimplemented in an infrastructure of remote computer systems connected byan internet or intranet.

Distributed computer system 30 includes a system area network (SAN) 32which is a high-bandwidth, low-latency network interconnecting nodeswithin distributed computer system 30. A node is herein defined to beany device attached to one or more links of a network and forming theorigin and/or destination of messages within the network. In the exampledistributed computer system 30, nodes include host processors 34 a–34 d;redundant array independent disk (RAID) subsystem 33; and I/O adapters35 a and 35 b. The nodes illustrated in FIG. 1 are for illustrativepurposes only, as SAN 32 can connect any number and any type ofindependent processor nodes, I/O adapter nodes, and I/O device nodes.Any one of the nodes can function as an endnode, which is herein definedto be a device that originates or finally consumes messages or frames inthe distributed computer system.

A message is herein defined to be an application-defined unit of dataexchange, which is a primitive unit of communication between cooperatingsequential processes. A frame is herein defined to be one unit of dataencapsulated by a physical network protocol header and/or trailer. Theheader generally provides control and routing information for directingthe frame through SAN 32. The trailer generally contains control andcyclic redundancy check (CRC) data for ensuring packets are notdelivered with corrupted contents.

SAN 32 is the communications and management infrastructure supportingboth I/O and interprocess communication (IPC) within distributedcomputer system 30. SAN 32 includes a switched communications fabric(SAN FABRIC) allowing many devices to concurrently transfer data withhigh-bandwidth and low latency in a secure, remotely managedenvironment. Endnodes can communicate over multiple ports and utilizemultiple paths through the SAN fabric. The multiple ports and pathsthrough SAN 32 can be employed for fault tolerance and increasedbandwidth data transfers.

SAN 32 includes switches 36 and routers 38. A switch is herein definedto be a device that connects multiple links 40 together and allowsrouting of frames from one link 40 to another link 40 within a subnetusing a small header destination ID field. A router is herein defined tobe a device that connects multiple links 40 together and is capable ofrouting frames from one link 40 in a first subnet to another link 40 ina second subnet using a large header destination address or sourceaddress.

In one embodiment, a link 40 is a full duplex channel between any twonetwork fabric elements, such as endnodes, switches 36, or routers 38.Example suitable links 40 include, but are not limited to, coppercables, optical cables, and printed circuit copper traces on backplanesand printed circuit boards.

Endnodes, such as host processor endnodes 34 and I/O adapter endnodes35, generate request frames and return acknowledgment frames. Bycontrast, switches 36 and routers 38 do not generate and consume frames.Switches 36 and routers 38 simply pass frames along. In the case ofswitches 36, the frames are passed along unmodified. For routers 38, thenetwork header is modified slightly when the frame is routed. Endnodes,switches 36, and routers 38 are collectively referred to as endstations.

In distributed computer system 30, host processor nodes 34 a–34 d andRAID subsystem node 33 include at least one system area networkinterface controller (SANIC) 42. In one embodiment, each SANIC 42 is anendpoint that implements the SAN 32 interface in sufficient detail tosource or sink frames transmitted on the SAN fabric. The SANICs 42provide an interface to the host processors and I/O devices. In oneembodiment the SANIC is implemented in hardware. In this SANIC hardwareimplementation, the SANIC hardware offloads much of CPU and I/O adaptercommunication overhead. This hardware implementation of the SANIC alsopermits multiple concurrent communications over a switched networkwithout the traditional overhead associated with communicatingprotocols. In one embodiment, SAN 32 provides the I/O and IPC clients ofdistributed computer system 30 zero processor-copy data transferswithout involving the operating system kernel process, and employshardware to provide reliable, fault tolerant communications.

As indicated in FIG. 1, router 38 is coupled to wide area network (WAN)and/or local area network (LAN) connections to other hosts or otherrouters 38.

The host processors 34 a–34 d include central processing units (CPUs) 44and memory 46.

I/O adapters 35 a and 35 b include an I/O adapter backplane 48 andmultiple I/O adapter cards 50. Example adapter cards 50 illustrated inFIG. 1 include an SCSI adapter card; an adapter card to fiber channelhub and FC-AL devices; an Ethernet adapter card; and a graphics adaptercard. Any known type of adapter card can be implemented. I/O adapters 35a and 35 b also include a switch 36 in the I/O adapter backplane 48 tocouple the adapter cards 50 to the SAN 32 fabric.

RAID subsystem 33 includes a microprocessor 52, memory 54, read/writecircuitry 56, and multiple redundant storage disks 58.

SAN 32 handles data communications for I/O and IPC in distributedcomputer system 30. SAN 32 supports high-bandwidth and scalabilityrequired for I/O and also supports the extremely low latency and low CPUoverhead required for IPC. User clients can bypass the operating systemkernel process and directly access network communication hardware, suchas SANICs 42 which enable efficient message passing protocols. SAN 32 issuited to current computing models and is a building block for new formsof I/O and computer cluster communication. SAN 32 allows I/O adapternodes to communicate among themselves or communicate with any or all ofthe processor nodes in distributed computer system 30. With an I/Oadapter attached to SAN 32, the resulting I/O adapter node hassubstantially the same communication capability as any processor node indistributed computer system 30.

Channel and Memory Semantics

In one embodiment, SAN 32 supports channel semantics and memorysemantics. Channel semantics is sometimes referred to as send/receive orpush communication operations, and is the type of communicationsemployed in a traditional I/O channel where a source device pushes dataand a destination device determines the final destination of the data.In channel semantics, the frame transmitted from a source processspecifies a destination processes' communication port, but does notspecify where in the destination processes' memory space the frame willbe written. Thus, in channel semantics, the destination processpre-allocates where to place the transmitted data.

In memory semantics, a source process directly reads or writes thevirtual address space of a remote node destination process. The remotedestination process need only communicate the location of a buffer fordata, and does not need to be involved with the transfer of any data.Thus, in memory semantics, a source process sends a data framecontaining the destination buffer memory address of the destinationprocess. In memory semantics, the destination process previously grantspermission for the source process to access its memory.

Channel semantics and memory semantics are typically both necessary forI/O and IPC. A typical I/O operation employs a combination of channeland memory semantics. In an illustrative example I/O operation ofdistributed computer system 30, host processor 34 a initiates an I/Ooperation by using channel semantics to send a disk write command to I/Oadapter 35 b. I/O adapter 35 b examines the command and uses memorysemantics to read the data buffer directly from the memory space of hostprocessor 34 a. After the data buffer is read, I/O adapter 35 b employschannel semantics to push an I/O completion message back to hostprocessor 34 a.

In one embodiment, distributed computer system 30 performs operationsthat employ virtual addresses and virtual memory protection mechanismsto ensure correct and proper access to all memory. In one embodiment,applications running in distributed computed system 30 are not requiredto use physical addressing for any operations.

Queue Pairs

An example host processor node 34 is generally illustrated in FIG. 2.Host processor node 34 includes a process A indicated at 60 and aprocess B indicated at 62. Host processor node 34 includes SANIC 42.Host processor node 34 also includes queue pairs (QP's) 64 a and 64 bwhich provide communication between process 60 and SANIC 42. Hostprocessor node 34 also includes QP 64 c which provides communicationbetween process 62 and SANIC 42. A single SANIC, such as SANIC 42 in ahost processor 34, can support thousands of QPs. By contrast, a SANinterface in an I/O adapter 35 typically supports less than ten QPs.

Each QP 64 includes a send work queue 66 and a receive work queue 68. Aprocess, such as processes 60 and 62, calls an operating-system specificprogramming interface which is herein referred to as verbs, which placework items, referred to as work queue elements (WQEs) onto a QP 64. AWQE is executed by hardware in SANIC 42. SANIC 42 is coupled to SAN 32via physical link 40. Send work queue 66 contains WQEs that describedata to be transmitted on the SAN 32 fabric. Receive work queue 68contains WQEs that describe where to place incoming data from the SAN 32fabric.

Host processor node 34 also includes completion queue 70 a interfacingwith process 60 and completion queue 70 b interfacing with process 62.The completion queues 70 contain information about completed WQEs. Thecompletion queues are employed to create a single point of completionnotification for multiple QPs. A completion queue entry is a datastructure on a completion queue 70 that describes a completed WQE. Thecompletion queue entry contains sufficient information to determine theQP that holds the completed WQE. A completion queue context is a blockof information that contains pointers to, length, and other informationneeded to manage the individual completion queues.

Example WQEs include work items that initiate data communicationsemploying channel semantics or memory semantics; work items that areinstructions to hardware in SANIC 42 to set or alter remote memoryaccess protections; and work items to delay the execution of subsequentWQEs posted in the same send work queue 66.

More specifically, example WQEs supported for send work queues 66 are asfollows. A send buffer WQE is a channel semantic operation to push alocal buffer to a remote QP's receive buffer. The send buffer WQEincludes a gather list to combine several virtual contiguous localbuffers into a single message that is pushed to a remote QP's receivebuffer. The local buffer virtual addresses are in the address space ofthe process that created the local QP.

A remote direct memory access (RDMA) read WQE provides a memory semanticoperation to read a virtually contiguous buffer on a remote node. TheRDMA read WQE reads a virtually contiguous buffer on a remote endnodeand writes the data to a virtually contiguous local memory buffer.Similar to the send buffer WQE, the local buffer for the RDMA read WQEis in the address space of the process that created the local QP. Theremote buffer is in the virtual address space of the process owning theremote QP targeted by the RDMA read WQE.

A RDMA write WQE provides a memory semantic operation to write avirtually contiguous buffer on a remote node. The RDMA write WQEcontains a scatter list of locally virtually contiguous buffers and thevirtual address of the remote buffer into which the local buffers arewritten.

A RDMA FetchOp WQE provides a memory semantic operation to perform anatomic operation on a remote word. The RDMA FetchOp WQE is a combinedRDMA read, modify, and RDMA write operation. The RDMA FetchOp WQE cansupport several read-modify-write operations, such as Compare and Swapif equal.

A bind/unbind remote access key (RKey) WQE provides a command to SANIChardware to modify the association of a RKey with a local virtuallycontiguous buffer. The RKey is part of each RDMA access and is used tovalidate that the remote process has permitted access to the buffer.

A delay WQE provides a command to SANIC hardware to delay processing ofthe QP's WQEs for a specific time interval. The delay WQE permits aprocess to meter the flow of operations into the SAN fabric.

In one embodiment, receive queues 68 only support one type of WQE, whichis referred to as a receive buffer WQE. The receive buffer WQE providesa channel semantic operation describing a local buffer into whichincoming send messages are written. The receive buffer WQE includes ascatter list describing several virtually contiguous local buffers. Anincoming send message is written to these buffers. The buffer virtualaddresses are in the address space of the process that created the localQP.

For IPC, a user-mode software process transfers data through QPs 64directly from where the buffer resides in memory. In one embodiment, thetransfer through the QPs bypasses the operating system and consumes fewhost instruction cycles. QPs 64 permit zero processor-copy data transferwith no operating system kernel involvement. The zero processor-copydata transfer provides for efficient support of high-bandwidth andlow-latency communication.

Transport Services

When a QP 64 is created, the QP is set to provide a selected type oftransport service. In one embodiment, a distributed computer systemimplementing the present invention supports four types of transportservices.

A portion of a distributed computer system employing a reliableconnection service to communicate between distributed processes isillustrated generally at 100 in FIG. 3. Distributed computer system 100includes a host processor node 102, a host processor node 104, and ahost processor node 106. Host processor node 102 includes a process Aindicated at 108. Host processor node 104 includes a process B indicatedat 110 and a process C indicated at 112. Host processor node 106includes a process D indicated at 114.

Host processor node 102 includes a QP 116 having a send work queue 116 aand a receive work queue 116 b; a QP 118 having a send work queue 118 aand receive work queue 118 b; and a QP 120 having a send work queue 120a and a receive work queue 120 b which facilitate communication to andfrom process A indicated at 108. Host processor node 104 includes a QP122 having a send work queue 122 a and receive work queue 122 b forfacilitating communication to and from process B indicated at 110. Hostprocessor node 104 includes a QP 124 having a send work queue 124 a andreceive work queue 124 b for facilitating communication to and fromprocess C indicated at 112. Host processor node 106 includes a QP 126having a send work queue 126 a and receive work queue 126 b forfacilitating communication to and from process D indicated at 114.

The reliable connection service of distributed computer system 100associates a local QP with one and only one remote QP. Thus, QP 116 isconnected to QP 122 via a non-sharable resource connection 128 having anon-sharable resource connection 128 a from send work queue 116 a toreceive work queue 122 b and a non-sharable resource connection 128 bfrom send work queue 122 a to receive work queue 116 b. QP 118 isconnected to QP 124 via a non-sharable resource connection 130 having anon-sharable resource connection 130 a from send work queue 118 a toreceive work queue 124 b and a non-sharable resource connection 130 bfrom send work queue 124 a to receive work queue 118 b. QP 120 isconnected to QP 126 via a non-sharable resource connection 132 having anon-sharable resource connection 132 a from send work queue 120 a toreceive work queue 126 b and a non-sharable resource connection 132 bfrom send work queue 126 a to receive work queue 120 b.

A send buffer WQE placed on one QP in a reliable connection servicecauses data to be written into the receive buffer of the connected QP.RDMA operations operate on the address space of the connected QP.

The reliable connection service requires a process to create a QP foreach process which is to communicate with over the SAN fabric. Thus, ifeach of N host processor nodes contain M processes, and all M processeson each node wish to communicate with all the processes on all the othernodes, each host processor node requires M²×(N−1) QPs. Moreover, aprocess can connect a QP to another QP on the same SANIC.

In one embodiment, the reliable connection service is made reliablebecause hardware maintains sequence numbers and acknowledges all frametransfers. A combination of hardware and SAN driver software retries anyfailed communications. The process client of the QP obtains reliablecommunications even in the presence of bit errors, receive bufferunderruns, and network congestion. If alternative paths exist in the SANfabric, reliable communications can be maintained even in the presenceof failures of fabric switches or links.

In one embodiment, acknowledgements are employed to deliver datareliably across the SAN fabric. In one embodiment, the acknowledgementis not a process level acknowledgment, because the acknowledgment doesnot validate the receiving process has consumed the data. Rather, theacknowledgment only indicates that the data has reached its destination.

A portion of a distributed computer system employing a reliable datagramservice to communicate between distributed processes is illustratedgenerally at 150 in FIG. 4. Distributed computer system 150 includes ahost processor node 152, a host processor node 154, and a host processornode 156. Host processor node 152 includes a process A indicated at 158.Host processor node 154 includes a process B indicated at 160 and aprocess C indicated at 162. Host processor node 156 includes a process Dindicated at 164.

Host processor node 152 includes QP 166 having send work queue 166 a andreceive work queue 166 b for facilitating communication to and fromprocess A indicated at 158. Host processor node 154 includes QP 168having send work queue 168 a and receive work queue 168 b forfacilitating communication from and to process B indicated at 160. Hostprocessor node 154 includes QP 170 having send work queue 170 a andreceive work queue 170 b for facilitating communication from and toprocess C indicated at 162. Host processor node 156 includes QP 172having send work queue 172 a and receive work queue 172 b forfacilitating communication from and to process D indicated at 164. Inthe reliable datagram service implemented in distributed computer system150, the QPs are coupled in what is referred to as a connectionlesstransport service.

For example, a reliable datagram service 174 couples QP 166 to QPs 168,170, and 172. Specifically, reliable datagram service 174 couples sendwork queue 166 a to receive work queues 168 b, 170 b, and 172 b.Reliable datagram service 174 also couples send work queues 168 a, 170a, and 172 a to receive work queue 166 b.

The reliable datagram service permits a client process of one QP tocommunicate with any other QP on any other remote node. At a receivework queue, the reliable datagram service permits incoming messages fromany send work queue on any other remote node.

In one embodiment, the reliable datagram service employs sequencenumbers and acknowledgments associated with each message frame to ensurethe same degree of reliability as the reliable connection service.End-to-end (EE) contexts maintain end-to-end specific state to keeptrack of sequence numbers, acknowledgments, and time-out values. Theend-to-end state held in the EE contexts is shared by all theconnectionless QPs communicating between a pair of endnodes. Eachendnode requires at least one EE context for every endnode it wishes tocommunicate with in the reliable datagram service (e.g., a given endnoderequires at least N EE contexts to be able to have reliable datagramservice with N other endnodes).

The reliable datagram service greatly improves scalability because thereliable datagram service is connectionless. Therefore, an endnode witha fixed number of QPs can communicate with far more processes andendnodes with a reliable datagram service than with a reliableconnection transport service. For example, if each of N host processornodes contain M processes, and all M processes on each node wish tocommunicate with all the processes on all the other nodes, the reliableconnection service requires M²×(N×1) QPs on each node. By comparison,the connectionless reliable datagram service only requires M QPs+(N−1)EE contexts on each node for exactly the same communications.

A third type of transport service for providing communications is aunreliable datagram service. Similar to the reliable datagram service,the unreliable datagram service is connectionless. The unreliabledatagram service is employed by management applications to discover andintegrate new switches, routers, and endnodes into a given distributedcomputer system. The unreliable datagram service does not provide thereliability guarantees of the reliable connection service and thereliable datagram service. The unreliable datagram service accordinglyoperates with less state information maintained at each endnode.

A fourth type of transport service is referred to as raw datagramservice and is technically not a transport service. The raw datagramservice permits a QP to send and to receive raw datagram frames. The rawdatagram mode of operation of a QP is entirely controlled by software.The raw datagram mode of the QP is primarily intended to allow easyinterfacing with traditional internet protocol, version 6 (IPv6) LAN-WANnetworks, and further allows the SANIC to be used with full softwareprotocol stacks to access transmission control protocol (TCP), userdatagram protocol (UDP), and other standard communication protocols.Essentially, in the raw datagram service, SANIC hardware generates andconsumes standard protocols layered on top of IPv6, such as TCP and UDP.The frame header can be mapped directly to and from an IPv6 header.Native IPv6 frames can be bridged into the SAN fabric and delivereddirectly to a QP to allow a client process to support any transportprotocol running on top of IPv6. A client process can register withSANIC hardware in order to direct datagrams for a particular upper levelprotocol (e.g., TCP and UDP) to a particular QP. SANIC hardware candemultiplex incoming IPv6 streams of datagrams based on a next headerfield as well as the destination IP address.

SANIC and I/O Adapter Endnodes

An example host processor node is generally illustrated at 200 in FIG.5. Host processor node 200 includes a process A indicated at 202, aprocess B indicated at 204, and a process C indicated at 206. Hostprocessor 200 includes a SANIC 208 and a SANIC 210. As discussed above,a host processor endnode or an I/O adapter endnode can have one or moreSANICs. SANIC 208 includes a SAN link level engine (LLE) 216 forcommunicating with SAN fabric 224 via link 217 and an LLE 218 forcommunicating with SAN fabric 224 via link 219. SANIC 210 includes anLLE 220 for communicating with SAN fabric 224 via link 221 and an LLE222 for communicating with SAN fabric 224 via link 223. SANIC 208communicates with process A indicated at 202 via QPs 212 a and 212 b.SANIC 208 communicates with process B indicated at 204 via QPs 212 c–212n. Thus, SANIC 208 includes N QPs for communicating with processes A andB. SANIC 210 includes QPs 214 a and 214 b for communicating with processB indicated at 204. SANIC 210 includes QPs 214 c–214 n for communicatingwith process C indicated at 206. Thus, SANIC 210 includes N QPs forcommunicating with processes B and C.

An LLE runs link level protocols to couple a given SANIC to the SANfabric. RDMA traffic generated by a SANIC can simultaneously employmultiple LLEs within the SANIC which permits striping across LLEs.Striping refers to the dynamic sending of frames within a single messageto an endnode's QP through multiple fabric paths. Striping across LLEsincreases the bandwidth for a single QP as well as provides multiplefault tolerant paths. Striping also decreases the latency for messagetransfers. In one embodiment, multiple LLEs in a SANIC are not visibleto the client process generating message requests. When a host processorincludes multiple SANICs, the client process must explicitly move dataon the two SANICs in order to gain parallelism. A single QP cannot beshared by SANICS. Instead a QP is owned by one local SANIC.

The following is an example naming scheme for naming and identifyingendnodes in one embodiment of a distributed computer system according tothe present invention. A host name provides a logical identification fora host node, such as a host processor node or I/O adapter node. The hostname identifies the endpoint for messages such that messages are destinefor processes residing on an endnode specified by the host name. Thus,there is one host name per node, but a node can have multiple SANICs.

A globally unique ID (GUID) identifies a transport endpoint. A transportendpoint is the device supporting the transport QPs. There is one GUIDassociated with each SANIC.

A local ID refers to a short address ID used to identify a SANIC withina single subnet. In one example embodiment, a subnet has up 216endnodes, switches, and routers, and the local ID (LID) is accordingly16 bits. A source LID (SLID) and a destination LID (DLID) are the sourceand destination LIDs used in a local network header. A LLE has a singleLID associated with the LLE, and the LID is only unique within a givensubnet. One or more LIDs can be associated with each SANIC.

An internet protocol (IP) address (e.g., a 128 bit IPv6 ID) addresses aSANIC. The SANIC, however, can have one or more IP addresses associatedwith the SANIC. The IP address is used in the global network header whenrouting frames outside of a given subnet. LIDs and IP addresses arenetwork endpoints and are the target of frames routed through the SANfabric. All IP addresses (e.g., IPv6 addresses) within a subnet share acommon set of high order address bits.

In one embodiment, the LLE is not named and is not architecturallyvisible to a client process. In this embodiment, management softwarerefers to LLEs as an enumerated subset of the SANIC.

Switches and Routers

A portion of a distributed computer system is generally illustrated at250 in FIG. 6. Distributed computer system 250 includes a subnet Aindicated at 252 and a subnet B indicated at 254. Subnet A indicated at252 includes a host processor node 256 and a host processor node 258.Subnet B indicated at 254 includes a host processor node 260 and hostprocessor node 262. Subnet A indicated at 252 includes switches 264a–264 c. Subnet B indicated at 254 includes switches 266 a–266 c. Eachsubnet within distributed computer system 250 is connected to othersubnets with routers. For example, subnet A indicated at 252 includesrouters 268 a and 268 b which are coupled to routers 270 a and 270 b ofsubnet B indicated at 254. In one example embodiment, a subnet has up to2¹⁶ endnodes, switches, and routers.

A subnet is defined as a group of endnodes and cascaded switches that ismanaged as a single unit. Typically, a subnet occupies a singlegeographic or functional area. For example, a single computer system inone room could be defined as a subnet. In one embodiment, the switchesin a subnet can perform very fast worm-hole or cut-through routing formessages.

A switch within a subnet examines the DLID that is unique within thesubnet to permit the switch to quickly and efficiently route incomingmessage frames. In one embodiment, the switch is a relatively simplecircuit, and is typically implemented as a single integrated circuit. Asubnet can have hundreds to thousands of endnodes formed by cascadedswitches.

As illustrated in FIG. 6, for expansion to much larger systems, subnetsare connected with routers, such as routers 268 and 270. The routerinterprets the IP destination ID (e.g., IPv6 destination ID) and routesthe IP like frame.

In one embodiment, switches and routers degrade when links are overutilized. In this embodiment, link level back pressure is used totemporarily slow the flow of data when multiple input frames compete fora common output. However, link or buffer contention does not cause lossof data. In one embodiment, switches, routers, and endnodes employ alink protocol to transfer data. In one embodiment, the link protocolsupports an automatic error retry. In this example embodiment, linklevel acknowledgments detect errors and force retransmission of any dataimpacted by bit errors. Link-level error recovery greatly reduces thenumber of data errors that are handled by the end-to-end protocols. Inone embodiment, the user client process is not involved with errorrecovery no matter if the error is detected and corrected by the linklevel protocol or the end-to-end protocol.

An example embodiment of a switch is generally illustrated at 280 inFIG. 7. Each I/O path on a switch or router has an LLE. For example,switch 280 includes LLEs 282 a–282 h for communicating respectively withlinks 284 a–284 h.

The naming scheme for switches and routers is similar to theabove-described naming scheme for endnodes. The following is an exampleswitch and router naming scheme for identifying switches and routers inthe SAN fabric. A switch name identifies each switch or group ofswitches packaged and managed together. Thus, there is a single switchname for each switch or group of switches packaged and managed together.

Each switch or router element has a single unique GUID. Each switch hasone or more LIDs and IP addresses (e.g., IPv6 addresses) that are usedas an endnode for management frames.

Each LLE is not given an explicit external name in the switch or router.Since links are point-to-point, the other end of the link does not needto address the LLE.

Virtual Lanes

Switches and routers employ multiple virtual lanes within a singlephysical link. As illustrated in FIG. 6, physical links 272 connectendnodes, switches, and routers within a subnet. WAN or LAN connections274 typically couple routers between subnets. Frames injected into theSAN fabric follow a particular virtual lane from the frame's source tothe frame's destination. At any one time, only one virtual lane makesprogress on a given physical link. Virtual lanes provide a technique forapplying link level flow control to one virtual lane without affectingthe other virtual lanes. When a frame on one virtual lane blocks due tocontention, quality of service (QoS), or other considerations, a frameon a different virtual lane is allowed to make progress.

Virtual lanes are employed for numerous reasons, some of which are asfollows. Virtual lanes provide QoS. In one example embodiment, certainvirtual lanes are reserved for high priority or isonchronous traffic toprovide QoS.

Virtual lanes provide deadlock avoidance. Virtual lanes allow topologiesthat contain loops to send frames across all physical links and still beassured the loops won't cause back pressure dependencies that mightresult in deadlock.

Virtual lanes alleviate head-of-line blocking. With virtual lanes, ablocked frames can pass a temporarily stalled frame that is destined fora different final destination.

In one embodiment, each switch includes its own crossbar switch. In thisembodiment, a switch propagates data from only one frame at a time, pervirtual lane through its crossbar switch. In another words, on any onevirtual lane, a switch propagates a single frame from start to finish.Thus, in this embodiment, frames are not multiplexed together on asingle virtual lane.

Paths in SAN fabric

Referring to FIG. 6, within a subnet, such as subnet A indicated at 252or subnet B indicated at 254, a path from a source port to a destinationport is determined by the LID of the destination SANIC port. Betweensubnets, a path is determined by the IP address (e.g., IPv6 address) ofthe destination SANIC port.

In one embodiment, the paths used by the request frame and the requestframe's corresponding positive acknowledgment (ACK) or negativeacknowledgment (NAK) frame are not required to be symmetric. In oneembodiment employing oblivious routing, switches select an output portbased on the DLID. In one embodiment, a switch uses one set of routingdecision criteria for all its input ports. In one example embodiment,the routing decision criteria is contained in one routing table. In analternative embodiment, a switch employs a separate set of criteria foreach input port.

Each port on an endnode can have multiple IP addresses. Multiple IPaddresses can be used for several reasons, some of which are provided bythe following examples. In one embodiment, different IP addressesidentify different partitions or services on an endnode. In oneembodiment, different IP addresses are used to specify different QoSattributes. In one embodiment, different IP addresses identify differentpaths through intra-subnet routes.

In one embodiment, each port on an endnode can have multiple LIDs.Multiple LIDs can be used for several reasons some of which are providedby the following examples. In one embodiment, different LIDs identifydifferent partitions or services on an endnode. In one embodiment,different LIDs are used to specify different QoS attributes. In oneembodiment, different LIDs specify different paths through the subnet.

A one-to-one correspondence does not necessarily exist between LIDs andIP addresses, because a SANIC can have more or less LIDs than IPaddresses for each port. For SANICs with redundant ports and redundantconductivity to multiple SAN fabrics, SANICs can, but are not requiredto, use the same LID and IP address on each of its ports.

Data Transactions

Referring to FIG. 1, a data transaction in distributed computer system30 is typically composed of several hardware and software steps. Aclient process of a data transport service can be a user-mode or akernel-mode process. The client process accesses SANIC 42 hardwarethrough one or more QPs, such as QPs 64 illustrated in FIG. 2. Theclient process calls an operating-system specific programming interfacewhich is herein referred to as verbs. The software code implementing theverbs intern posts a WQE to the given QP work queue.

There are many possible methods of posting a WQE and there are manypossible WQE formats, which allow for various cost/performance designpoints, but which do not affect interoperability. A user process,however, must communicate to verbs in a well-defined manner, and theformat and protocols of data transmitted across the SAN fabric must besufficiently specified to allow devices to interoperate in aheterogeneous vendor environment.

In one embodiment, SANIC hardware detects WQE posting and accesses theWQE. In this embodiment, the SANIC hardware translates and validates theWQEs virtual addresses and accesses the data. In one embodiment, anoutgoing message buffer is split into one or more frames. In oneembodiment, the SANIC hardware adds a transport header and a networkheader to each frame. The transport header includes sequence numbers andother transport information. The network header includes the destinationIP address or the DLID or other suitable destination addressinformation. The appropriate local or global network header is added toa given frame depending on if the destination endnode resides on thelocal subnet or on a remote subnet.

A frame is a unit of information that is routed through the SAN fabric.The frame is an endnode-to-endnode construct, and is thus created andconsumed by endnodes. Switches and routers neither generate nor consumerequest frames or acknowledgment frames. Instead switches and routerssimply move request frames or acknowledgment frames closer to theultimate destination. Routers, however, modify the frame's networkheader when the frame crosses a subnet boundary. In traversing a subnet,a single frame stays on a single virtual lane.

When a frame is placed onto a link, the frame is further broken downinto flits. A flit is herein defined to be a unit of link-level flowcontrol and is a unit of transfer employed only on a point-to-pointlink. The flow of flits is subject to the link-level protocol which canperform flow control or retransmission after an error. Thus, flit is alink-level construct that is created at each endnode, switch, or routeroutput port and consumed at each input port. In one embodiment, a flitcontains a header with virtual lane error checking information, sizeinformation, and reverse channel credit information.

If a reliable transport service is employed, after a request framereaches its destination endnode, the destination endnode sends anacknowledgment frame back to the sender endnode. The acknowledgmentframe permits the requestor to validate that the request frame reachedthe destination endnode. An acknowledgment frame is sent back to therequestor after each request frame. The requester can have multipleoutstanding requests before it receives any acknowledgments. In oneembodiment, the number of multiple outstanding requests is determinedwhen a QP is created.

Example Request and Acknowledgment Transactions

FIGS. 8, 9A, 9B, 10A, and 10B together illustrate example request andacknowledgment transactions. In FIG. 8, a portion of a distributedcomputer system is generally illustrated at 300. Distributed computersystem 300 includes a host processor node 302 and a host processor node304. Host processor node 302 includes a SANIC 306. Host processor node304 includes a SANIC 308. Distributed computer system 300 includes a SANfabric 309 which includes a switch 310 and a switch 312. SAN fabric 309includes a link 314 coupling SANIC 306 to switch 310; a link 316coupling switch 310 to switch 312; and a link 318 coupling SANIC 308 toswitch 312.

In the example transactions, host processor node 302 includes a clientprocess A indicated at 320. Host processor node 304 includes a clientprocess B indicated at 322. Client process 320 interacts with SANIChardware 306 through QP 324. Client process 322 interacts with SANIChardware 308 through QP 326. QP 324 and 326 are software datastructures. QP 324 includes send work queue 324 a and receive work queue324 b. QP 326 includes send work queue 326 a and receive work queue 326b.

Process 320 initiates a message request by posting WQEs to send queue324 a. Such a WQE is illustrated at 330 in FIG. 9A. The message requestof client process 320 is referenced by a gather list 332 contained insend WQE 330. Each entry in gather list 332 points to a virtuallycontiguous buffer in the local memory space containing a part of themessage, such as indicated by virtual contiguous buffers 334 a–334 d,which respectively hold message 0, parts 0, 1, 2, and 3.

Referring to FIG. 9B, hardware in SANIC 306 reads WQE 330 and packetizesthe message stored in virtual contiguous buffers 334 a–334 d into framesand flits. As illustrated in FIG. 9B, all of message 0, part 0 and aportion of message 0, part 1 are packetized into frame 0, indicated at336 a. The rest of message 0, part 1 and all of message 0, part 2, andall of message 0, part 3 are packetized into frame 1, indicated at 336b. Frame 0 indicated at 336 a includes network header 338 a andtransport header 340 a. Frame 1 indicated at 336 b includes networkheader 338 b and transport header 340 b.

As indicated in FIG. 9B, frame 0 indicated at 336 a is partitioned intoflits 0–3, indicated respectively at 342 a–342 d. Frame 1 indicated at336 b is partitioned into flits 4–7 indicated respectively at 342 e –342h. Flits 342 a through 342 h respectively include flit headers 344 a–344h.

Frames are routed through the SAN fabric, and for reliable transferservices, are acknowledged by the final destination endnode. If notsuccessively acknowledged, the frame is retransmitted by the sourceendnode. Frames are generated by source endnodes and consumed bydestination endnodes. The switches and routers in the SAN fabric neithergenerate nor consume frames.

Flits are the smallest unit of flow control in the network. Flits aregenerated and consumed at each end of a physical link. Flits areacknowledged at the receiving end of each link and are retransmitted inresponse to an error.

Referring to FIG. 10A, the send request message 0 is transmitted fromSANIC 306 in host processor node 302 to SANIC 308 in host processor node304 as frames 0 indicated at 336 a and frame 1 indicated at 336 b. ACKframes 346 a and 346 b, corresponding respectively to request frames 336a and 336 b, are transmitted from SANIC 308 in host processor node 304to SANIC 306 in host processor node 302.

In FIG. 10A, message 0 is being transmitted with a reliable transportservice. Each request frame is individually acknowledged by thedestination endnode (e.g., SANIC 308 in host processor node 304).

FIG. 10B illustrates the flits associated with the request frames 336and acknowledgment frames 346 illustrated in FIG. 10A passing betweenthe host processor endnodes 302 and 304 and the switches 310 and 312. Asillustrated in FIG. 10B, an ACK frame fits inside one flit. In oneembodiment, one acknowledgment flit acknowledges several flits.

As illustrated in FIG. 10B, flits 342 a–h are transmitted from SANIC 306to switch 310. Switch 310 consumes flits 342 a–h at its input port,creates flits 348 a–h at its output port corresponding to flits 342 a–h,and transmits flits 348 a–h to switch 312. Switch 312 consumes flits 348a–h at its input port, creates flits 350 a–h at its output portcorresponding to flits 348 a–h, and transmits flits 350 a–h to SANIC308. SANIC 308 consumes flits 350 a–h at its input port. Anacknowledgment flit is transmitted from switch 310 to SANIC 306 toacknowledge the receipt of flits 342 a–h. An acknowledgment flit 354 istransmitted from switch 312 to switch 310 to acknowledge the receipt offlits 348 a–h. An acknowledgment flit 356 is transmitted from SANIC 308to switch 312 to acknowledge the receipt of flits 350 a–h.

Acknowledgment frame 346 a fits inside of flit 358 which is transmittedfrom SANIC 308 to switch 312. Switch 312 consumes flits 358 at its inputport, creates flit 360 corresponding to flit 358 at its output port, andtransmits flit 360 to switch 310. Switch 310 consumes flit 360 at itsinput port, creates flit 362 corresponding to flit 360 at its outputport, and transmits flit 362 to SANIC 306. SANIC 306 consumes flit 362at its input port. Similarly, SANIC 308 transmits acknowledgment frame346 b in flit 364 to switch 312. Switch 312 creates flit 366corresponding to flit 364, and transmits flit 366 to switch 310. Switch310 creates flit 368 corresponding to flit 366, and transmits flit 368to SANIC 306.

Switch 312 acknowledges the receipt of flits 358 and 364 withacknowledgment flit 370, which is transmitted from switch 312 to SANIC308. Switch 310 acknowledges the receipt of flits 360 and 366 withacknowledgment flit 372, which is transmitted to switch 312. SANIC 306acknowledges the receipt of flits 362 and 368 with acknowledgment flit374 which is transmitted to switch 310.

Architecture Layers and Implementation Overview

A host processor endnode and an I/O adapter endnode typically have quitedifferent capabilities. For example, an example host processor endnodemight support four ports, hundreds to thousands of QPs, and allowincoming RDMA operations, while an attached I/O adapter endnode mightonly support one or two ports, tens of QPs, and not allow incoming RDMAoperations. A low-end attached I/O adapter alternatively can employsoftware to handle much of the network and transport layer functionalitywhich is performed in hardware (e.g., by SANIC hardware) at the hostprocessor endnode.

One embodiment of a layered architecture for implementing the presentinvention is generally illustrated at 400 in diagram form in FIG. 11.The layered architecture diagram of FIG. 11 shows the various layers ofdata communication paths, and organization of data and controlinformation passed between layers.

Host SANIC endnode layers are generally indicated at 402. The host SANICendnode layers 402 include an upper layer protocol 404; a transportlayer 406; a network layer 408; a link layer 410; and a physical layer412.

Switch or router layers are generally indicated at 414. Switch or routerlayers 414 include a network layer 416; a link layer 418; and a physicallayer 420.

I/O adapter endnode layers are generally indicated at 422. I/O adapterendnode layers 422 include an upper layer protocol 424; a transportlayer 426; a network layer 428; a link layer 430; and a physical layer432.

The layered architecture 400 generally follows an outline of a classicalcommunication stack. The upper layer protocols employ verbs to createmessages at the transport layers. The transport layers pass messages tothe network layers. The network layers pass frames down to the linklayers. The link layers pass flits through physical layers. The physicallayers send bits or groups of bits to other physical layers. Similarly,the link layers pass flits to other link layers, and don't havevisibility to how the physical layer bit transmission is actuallyaccomplished. The network layers only handle frame routing, withoutvisibility to segmentation and reassembly of frames into flits ortransmission between link layers.

Bits or groups of bits are passed between physical layers via links 434.Links 434 can be implemented with printed circuit copper traces, coppercable, optical cable, or with other suitable links.

The upper layer protocol layers are applications or processes whichemploy the other layers for communicating between endnodes.

The transport layers provide end-to-end message movement. In oneembodiment, the transport layers provide four types of transportservices as described above which are reliable connection service;reliable datagram service; unreliable datagram service; and raw datagramservice.

The network layers perform frame routing through a subnet or multiplesubnets to destination endnodes.

The link layers perform flow-controlled, error controlled, andprioritized frame delivery across links.

The physical layers perform technology-dependent bit transmission andreassembly into flits.

Congestion Management Architecture

Congestion Control Mechanisms

Congestion control mechanisms fall into three categories: congestiondetection mechanisms; congestion reporting mechanisms; and congestionresponse mechanisms.

Congestion detection mechanisms covers the mechanisms used to detectcongestion in the various network topologies given SAN fabric willsupport.

Congestion reporting mechanisms covers the mechanism used to report theoccurrence of congestion for short term use in alleviating congestionand for long term network management use (e.g., to allow a networkmanagement entity to analyze the network and recommend further actionsto the system administrator).

Congestion response mechanisms covers the mechanisms used to alleviateor remove congestion from the various network topologies the SAN fabricwill support.

SAN fabric congestion detection mechanisms are tailored for end pointsand switches and must be supported by the end points (e.g., hosts andI/O) In one embodiment, all types of switches (i.e., low to high end)support given SAN fabric congestion detection mechanism. In oneembodiment, the switch case is more flexible: high-end switches mustsupport all the mechanisms, low-end switches must support only theabnormal congestion detection. The problem with this second approach isit is difficult to pin the distinction between a low-end and high-endswitch, and as a result a high-end switch may not implement much of thecongestion control mechanisms, which would defeat the purpose.

Congestion Detection Mechanisms

One embodiment of the present invention is directed to a congestionmanagement architecture in distributed computer systems which providefor efficient congestion control implementations to alleviate congestionproblems in the distributor computer system, such as computerdistributor system 30 of FIG. 1.

Switch and Router Mechanisms

Queue depth watermarking when queues in a switch reach a HighWaterMarkamount of total queue capacity, being to drop all frames that are markeddroppable. When queues remain at the HighWaterMark for anAbnormalCongestionTimer period or no forward progress is made on anysingle switch send port, consider the condition Abnormal Congestion andbegin to drop all frame types.

If switch queues are not very large, then the ration betweenHighWaterMark and total queue capacity may be too small to handledroppable frames in a fair manner. For low-end San fabric switches, withsmall queues, a queue depth based congestion detection mechanism is notpractical.

Time in queue Timestamp all frames placed in the switch queue uponreception. If a frame is queued in the switch for longer than a(programmable) time period, it will be discarded. Another option is forthe switch to use virtual lane (VL) credits for congestion detection andrespond by discarding frames marked with the oldest timestamps.

Similar to queue depth watermarking, the time in queue approach assumeswitch queues are relatively large, which for low-end SAN fabricswitches is a poor assumption.

VL credit starvation there are two components to this switch congestiondetection process: sender starvation and receiver starvation, both musthave occurred several times over a NormalCongestionTime period for theswitch to be under Normal Congestion. Sender starvation occurs when theswitch has accepts an incoming frame, but does not have a space in thesending port's frame (retransmission) queue. Receiver starvation occurswhen the switch detects VL credit starvation at a switch receive port.If both conditions occur simultaneously, the switch has detected NormalCongestion.

VL credit starvation can be used to detect congestion in switches thathave small queues for large queues. The VL credit starvation approachdescribed here must be supported by San fabric switches.

The switches must have two congestion detection timers:AbnormalCongestionTimer and NormalCongestionTimer.

The AbnormalCongestionTimer is used to detect a very long time periodover which no forward progress has been made on any single switchreceive/VL port. FN An architectural alternative would be to detect lackof forward progress at the receiver port by determining if any switchreceiver/VL has gone a timer period without having any link credits areavailable. Either approach works. Lack of forward progress at a switchreceiver port sounds backwards but it detects lack of forward progressat the point where actions taken at the detection point can easecongestion in the fabric. The switch detects lack of forward progress atany single one of its receive ports, by determining if any switchreceive/VL port has gone an AbnormalCongestionTime period without havingany link credits available. FN That is, the switch was not able toprovide credits, on any single VL, to the nearest neighbor connected tothe switches receiver port. If so, the switch reports anAbnormalCongestionTime condition and responds with the AbnormalCongestion mechanisms described below.

For Normal Congestion control the switch uses a combination ofreceive/VL port credit and send/VL port output buffer starvation.

A switch detects congestion at a send/VL port when the switch has aframe available for the send/VL port, but the send/VL port has no output(e.g., frame retransmission buffer) space available to accept the frame.If this occurs a programmable number of SendPortCongestion times duringa NormalCongestionTime period, then the send/VL port is considered to beunder congestion. The SendPortCongestion time will have a default valueof the flit round trip time between the switch end port and it's nearestneighbor receiver port divided by the number of frames the switch outputbuffer can store.

However, this condition alone is sufficient to differentiate betweenswitch congestion and excessive flow queue depth, because it onlydetects congestion at the send/VL port (vs a switchreceive/VL-to-send/VL port flow).

A switch detects congestion at a receiver/VL port when any single VL atthe switch's receive port has no credits available (i.e. the switch hasnon VL credits available to send the nearest neighbor attached to thatreceiver/VL port). If this occurs a programmable number ofReceivePortCongestion times during a NormalCongestionTime period, thenthe receive/VL port is considered to be under congestion. TheReceivePortCongestion time will have a default value of the flit roundtrip time between the switch receive port and it's nearest neighbor sendport divided by the number of frames the switch output buffers canstore.

If both congestion conditions occur a maximum programmable number ofSwitchCongest time during a NormalCongestionTime period, then the switchin under Normal Congestion. The SwitchCongested value will have somedefault value (e.g., 5). In one embodiment, a methodology is used forsetting the SwitchedCongested default value based on switch utilization,(e.g., the higher the switch is utilized, the lower the value).

The default value for the abnormal congestion timer will be set to a(high) value (e.g., 100 ms). For example, 100 ms corresponds with 256 KBframes at 1 GB/s for the first 10 generation. That is, no forwardprogress was allowed on the switch receive ports for 256 4 KB framecycle. An alternative is the set the default as scalable with linkbandwidth, as the link bandwidth goes up, the default value goes down.But if the maximum frame size increases as well, then a fixed value canhave the same cycle attributes. In one embodiment, the default value forthe normal congestion timer will be set to 1/Nth of the abnormalcongestion timer.

End Point Mechanisms

Explicit detection end point congestion detection mechanisms areimplemented at the end point receivers (i.e., destinations). Destinationdetection under this approach the destination must detect ForwardExplicit Congestion Notification (FECN) conditions forwarded at the flitlevel. The destination will forward the FECN to the source. The sourcewill then make the injection rate adjustments. In source detection underthis approach, the source must also detect FECN conditions forwarded atthe flit level for Read RDMAs. The source will then make the injectionrate adjustments.

Implicit detection end point congestion detection mechanisms areimplement at the end point sender (i.e., sources).

A network can implement a few implicit congestion detection mechanisms,from the simple to the complex. One embodiment supports one (ACKtime-out).

Frame to ACK cycle timing not recommended due to complexity andinability to function correctly when the network contains a mix of localand remote endpoints.

Under this approach, the injection rate (i.e., bytes per second) isadjusted by monitoring the previous injection rate and the cycle time offrames within the network. The cycle time calculation needs to be madeon the basis of the round trip time between a frame and it'scorresponding ACK. The cycle time calculation cannot be made based onthe time gap between ACKs, because the source may not always have framesto send and compensating for the frame sending time gap is not possible.If the source's frame injection rate is not continuous (i.e., thesources' send rate has time gaps), then those time gaps need to beaccounted for in a cycle time calculation that strictly looks at timegaps between ACKs. This compensation becomes very problematic. Let'ssay, the source calculates the time delay caused by the congested switchstage by calculating the time gap between incoming frame ACKs. Forexample, the ACK for frame 1 was received at time A and the ACK forframe 2 was received at time B, so that time gap would be B-A. Thisapproach would correctly reflect the time gap caused by the congestedstage, so long as the source injection rate has no time gaps. However,if the source's frame injection rate also has a tie gap, then the timegap would have to be compensated for by calculating the time gap betweenframe sends. For example frame sequence number 1 was sent at time X,frame sequence number 2 was sent at time Y the time gap would be Y-Z.Unfortunately, the frame injection time (Y-X) cannot be easily removedfrom the time gap caused by the congested state (B-A), because the(B-A−Y-X) calculation would not longer just reflect the effect of thecongested stage. This assumption is invalid for SAN traffic. The waythis approach works is as follows.

The source monitors the number of outstanding requests over asource-destination/VL path; the number of bytes/second that the sourceis ending over the source-destination/VL path; and the time gap betweeneach frame and it's corresponding ACK.

The source calculates the frame cycle time by calculating the time delaybetween a frame send and it's corresponding ACK or RNR_NAK received fromthe destination.

The source would then calculate the throughput as: Original frame sizedivided by the cycle time.

The source would then increase the injection rate until the throughputbeings to decrease. When the throughput begins to decrease, the sourcewould back up to the previous injection rate size.

The main issue with this approach is the complexity it causes for thesource's scheduler. It is believed that this complexity makes thisapproach unobtainable.

The problem with using a slight simpler ACK gap time approach is that itdoesn't compensate for source injection gaps (i.e., through putvariations at the source that are not caused by fabric congestionadjustments, but rather by source demand rate adjustments) and as aresult it doesn't perform it' intended function. A second, perhaps moreimportant problem with ACK gap timing is that when a given source hasflows with more than one minimum bandwidth, under congestion the higherbandwidth flow will have the same ACK gap timing as the lower bandwidthflows. As a result, the source will lower the link injection rate of allflows vs isolating the flows that are congested.

Performing ACK timeouts under this approach, the injection rate (i.e.,bytes/seconds outstanding) is adjusted by monitoring ACK time-outs. Theinjection rate can be lowered at various levels; message, frame, flit,or bytes/second. Given the wide range of frame sizes in a local fabric(e.g., from 32 byte request to a 4 GB disk sequential write. For thisexample, SAN fabric injection rate is in bytes/second. When an ACKtime-out occurs, the source assumes the ACK time-out occurred due tocongestion. That is, a stage in the network has dropped the frame due tocongestion. When an ACK time-out occurs, the source will modify theinjection rate by half and resume transmission form the last frameexpected. The source will then wait a fixed WANCongestionCleared timeperiod before increasing the window size. After the WANCongestionClearedtime period has elapsed, the source would increase the window sizelinearly.

Of these two implicit congestion detection approaches, ACK time-outsseem less complex for the source's scheduler. It is required to supportSAN Fabric over LAN/WAN fabrics. The proposal would be to use implicitcongestion detected based on ACK time-outs for paths that includenon-SAN fabrics, as follows.

The source's Transport level ERP would detect the ACK timeout. Thesource's schedule would cut the injection rate for the affected path(source-destination/VL) by half. The source would then beginretransmission of the affected queue pair starting at the next expectedframe. IN one embodiment, the source would wait a WANCongestionClearedtime period before increasing the injection rate. When theWANCongestionCleared timer pops the source would increase the injectionrate linearly.

In an alternative embodiment, the source would wait to receive ACK for aprogrammable number of WANUncongested frames. If WANUncongested framesget ACK'd, then the SAN Fabric WAN traffic is no longer undercongestion. So increase the injection rate linearly.

Congestion Reporting Mechanism

The forward explicit congestion notification is architected into theflit and frame layers of the fabric. For Send and Write RDMA frames FECNis detected at the flit layer and reported at the frame layer. ForFetchOP and Read RDMA frames FECN is detected at the flit layer andreported at the flit and frame layer: frames from source-destination(e.g., Read RDMA request) will get reported at the frame layer, andframes from destination-source (e.g., Read RDMA data) will get reportedat the flit layer. For ACK/NAK frames FECN is detected at the flitlayer, but the end-point will discard. FN Alternatively, and end-pointmay discriminate between ACK/NAK received in response to aSend/Write-RDMA frame; and don't adjust the injection rte for ACK/NAKswith a non-zero FECNCount received in response to a FetchOp/Read-RDMAframe.

Switch Mechanisms

In one embodiment, flits have 4 bits to carry FECN. These 4 bits arecalled the FECNCount and are contained in the flit delimeter. The sourcemust set the FECNCount to zero. SAN fabric switches will increment theFECNCount if the switch is under congestion, until the FECNCount reachesthe maximum value (15). When the FECNCount is equal to 15, switches willnot increment it, because it's already at its maximum. Each switch stateis responsible for maintaining the flit level FECN notification as itgoes across the switch's internal receive to sender path. This can bedone by carrying the flit types fields around, or simply carrying a bitaround.

For a given flow (source destination/VL), the FECNCount accumulates thenumber of switches that are under congestion.

Router Mechanisms

In one embodiment, flits have 4 bits to carry FECN. These 4 bits arecalled the FECNCount and are contained in the flit delimeter. The sourcemust set the FECNCount to zero. SAN fabric switches will increment theFECNCount if the switch is under congestion, until the FECNCount reachesthe maximum value (15). When the FECNCount is equal to 15, switches willnot increment it, because it's already at its maximum. Each switch stageis responsible for maintaining the flit level FECN notification as itgoes across the switch's internal receiver the sender path. This can bedone by carrying the flit types fields around, or simply carrying a bitaround.

SAN Fabric to non-SAN Fabric routers are not responsible for propagatingthe FECNCount fields across the non-SAN fabric. However, they areresponsible for sending a frame level Backward Explicit CongestionNotification (BECN) frame containing the FECNCount to the source of theflit that experienced congestion. That is, if a router receives a flitwith a non-zero FECNCount, the router is responsible for:

Generating a No-Op frame with the FECNCount field equal to the highestFECNCount in the flit delimeters of the outbound frame.

Sending the No-Op frame to the source of the outbound frame thatexperienced congestion. The NOP frame will be sent must ACK'd by thesource (i.e., or the source may not get any if the intermediate switchesdiscard unACK'd frames).

End Point Mechanisms

The destination's link layer is responsible for detecting the flit levelFECN notification and passing the FECN to the destination's transportlayer. Irregardless of the frame's error state (i.e., whether thedestination will ACK or NAK the frame), the destination's transportlayer is responsible for reporting the FECN back to the source forreliable service classes. The destination will set the FECNCount fieldin the outbound ACK/NAK frame to the highest FECNCount received in theflit delimiters associated with the inbound frame.

Congestion Response Mechanism

End Point Mechanisms

The source's scheduler should be contained in hardware for SAN Fabrictraffic over SAN fabrics, otherwise the major benefits of SAN Fabric canbe lost. A design issues is how much additional complexity does thedynamic adjustment add to the source's scheduler.

The source's scheduler has the ability to lower the max QP injectionrate based on the reception of an ACK or NAK with a non-zero FECNCountfrom the destination. There are several options for lowering the max QPinjection rate based on FECN.

One standard approach is to maintain two counters per QP: FECN0 andFECN1. FECN0 counts the number of ACK/NAKs received with a zeroFECNCount. FECN1 accumulates FECNCount(s) received from ACK/NAKs. Thecounts are accumulated over a time period FECN_Time of 4× static end—endRTT. If FECN1>=FECN0 over FECN_Time, then set the max QP injection rateto half (often percentage values can be used, such as 0.875). Theprevious max QP injection rate to twice the previous max QP injectionrate FN two more bits and one more timer can be implemented to dampenand settle down the injection rate oscillations. This basically usesaggressive QP injection rate acceleration, which can cause largerfluctuations in traffic, but also aggressively removes congestion. For aSAN, where the large fluctuations may impact performance, a morereasonable approach seems to be to modify the max QP injection rate morelinearly, say by reducing max QP injection rates at 85% under congestionand increasing max QP injection rates at 1.15% when congestion subsides.

This approach requires the following state per VL at the source'sscheduler:

-   -   FECN0 accumulates the number of frames with no congestion.    -   FECN1 accumulates the FECN count    -   FECN_Time counts down to zero. When it pops FECN 0 and FECN1 are        compared.    -   Increment Injection Rate when set its used to increment the        injection rate.

Decrement Injection Rate when set decrements the injection rate.

A second approach is to start a timer upon the first reception of anACK/NAK with a non-zero FECNCount from the destination and thenaccumulate the number of FECNCounts over a period of time starting withthe time of the first FECN and ending with a FECN_TIMER_POP. If thetotal number of FECNCounts collected over the time period is greaterthan a variable percentage (e.g., half) of the number of outstandingframes during that same time period, then reduce the injection rate.Otherwise, treat the condition as slight congestion and don't change theinjection rate.

If a link has been idle for a long time, then set the max QP injectionrate to half the previous maximum CP injection rate and increase theInjection Rate (IR) using the slow start algorithm: IR(i+1)=IR(i)*2.Where IR(i) is a rate measurement of the bytes/second that were ACK'dback from the destination.

The QP injection rate is on a bytes/second basis.

Switch Mechanisms

When the NormalCongestionTime pops, the switch will enter NC-State. Whenin this state, the switch will drop all frames that are marked droppable(i.e., unreliable datagram and raw frames). All frames received onreceive ports marked droppable will be dropped. The switch will make thelink credits that are freed from this process available to the switch'snearest neighbor using centralized weighted fairness.

The switch will continue to drop frames marked droppable for a timeperiod of 2× the NormalCongestionTimer. This provides weighted fairness(a NormalCongestionTime period) for droppable frames. The switch willthen rest the NC_state and restart the NormalCongestionTime timer.

When the AbnormalCongestionTime pops, the switch will drop all framesand consider the situation a permanent error. Meaning the condition hasnot gone away over a long period and for all intents and purposes it'sdue to a permanent error, (e.g., dead link, dead destination, brokedestination (i.e., no receive WQEs ever). In any case, the source willdetect an ACK timeout and will respond according to the policies set inthe next sections.

Congestion Behavior

In one embodiment, SAN Fabric components implement levers (mechanisms)appropriate to the component type. In one embodiment, SAN Fabricswitches implement a weighted fairness queuing algorithm that preventreceiver starvation. Some levers will be set to a fixed value. Somelevers will be variable and set by an algorithm defined in the SANFabric specification.

EXAMPLE CONGESTION MANAGEMENT POLICIES Example 1 No Drops within SANFabric, Drops when Non-SAN Fabric Transports Solely Over a SAN FabricNetwork

Link lever back pressure is used. This means there are not any lotframes due to congestion.

The NormalCongestionTimer=Abnormal Congestion Timer.

Each QP uses the minimum number of outstanding requests to achieve thedesired BW for the necessary distance to the destination. (i.e., oncemaximum BW is achieved, a larger window size can only increasecongestion).

Each QP should inject frames at no higher than the maximum W of theslowest link in use. This is important if there are multiple speed SANFabric links in use.

The request-response timer should be set high enough that a time-outimplies a frame is lost due to an error (as opposed to congestion).

Legacy Protocols Solely over an SAN Fabric network.

Legacy protocol (e.g. TCP) are sent as Raw Datagram frames. These framesdo not have an SAN Fabric acknowledgment frame. All acknowledgment occurat the legacy ULP level.

Link lever back pressure means there aren't any lost frames due tocongestion.

Frames are injected into the network at some maximum injection rate(e.g. specified in MB/s or frames/second). This maximum rate is based ona QoS parameter and of course on the minimum speed SAN Fabric link inthe path between source and destination.

The ULP SW reduces the window size if ULP acknowledgments do not returnwithin a certain time period. In addition to reducing window size, theULP SW may choose to reduce the injection rate of frames into the SANFabric network (i.e. scale back to BW transmitted).

The SW stack controlling the QP should be able to easily set the maximuminjection rate, e.g., with a WQE or as port of the post Send verb.

Dropped frames (e.g., due to bit errors or other improbably occurrences)are handled by the ULP and not by the SANIC driver.

SAN Fabric Transport over Both an SAN Fabric Network and WAN Networks(s)

Link level back pressure is used within the SAN Fabric network.

Should the WAN drop frames due to congestion, the frame ACK timer willexpire and invoke the implicit congestion response mechanism.

Legacy Protocols over a Mix of SAN Fabric and WAN Networks

As above, the maximum BW injected into the SAN Fabric network should beless than or equal to the maximum BW of the WAN and link level backpressure is used within the SAN Fabric network. WAN is assumed to beslower than the SAN Fabric.

Should the WAN drop frames due to congestion, the legacy ULP willtimeout and notice it hasn't received an acknowledgment. It willretransmit frames with a smaller window size and/or with a lower rate ofinjection into the SAN Fabric network.

The legacy ULP driver will upon receiving acknowledgments increase itswindow size and/or increase its rate of injection into the SAN Fabric.

Example 2 Drop Datagram and Raw Frame Under Normal Congestion within SANFabric, Drops when Non-SAN Fabric Transports Solely Over a SAN FabricNetwork

Link level back pressure is used. This means there aren't any lostframes due to congestion.

NormalCongestionTimer is set to 1/Nth the value of theAbnormalCongestionTimer.

All of the normal congestion detection, reporting and responsemechanisms are implemented, summarized below for completeness:

Detection:

-   -   Switch—Detects congestion by analyzing receive and send port        resources as stated earlier.    -   Source—Detects congestion reported by analyzing the FECNCount        field in the frame transport header as stated earlier.    -   Destination—Detects congestion reported by analyzing the        FECNCount field in the frame transport header as stated earlier.

Reporting:

-   -   Switch—Propagates the FECNCount field in the flit delimiters as        stated earlier.    -   Routers—When a flit has non-zero FECNCount field, sends a No-Op        frame to the flit source with the FECNCount field equal to the        highest FECNCount of the flits associated with the frame.    -   Destination—Sets the ACK/NAK FECNCount field equal to the        highest FECNCount of the flits associated with the frame.

Response:

Switch—Drops frames when NormalCongestion is encountered as statedearlier.

Source—Lowers injection rate based on FECNCount as described earlier.

Legacy Protocols Solely Over a SAN Fabric Network

Legacy protocols (e.g., TCP) are sent as Raw Datagram frames. Theseframes do not have a SAN Fabric acknowledgment frame. Allacknowledgments occur at the legacy ULP level.

By setting the NormalCongestionTimer AbnormalCongestionTimer, frameswill be lost due to normal congestion.

Frame loss will invoke the legacy protocol's injection rate or windowsize reduction algorithms.

SAN Fabric Transport Over Both a SAN Fabric and WAN Network(s)

Link level back pressure is used within the SAN Fabric network.

Should the WAN drop frames due to congestion, the frame Ack timer willexpire and invoke the implicit congestion response mechanism.

Legacy Protocols Over a Mix of SAN Fabric and WAN Networks

Should the WAN drop frames due to congestion, the legacy ULP willtimeout and notice it hasn't received an acknowledgment. It willretransmit frames with a smaller window size and/or with a lower rate ofinjection into the SAN Fabric network.

The legacy ULP driver will upon receiving acknowledgments increase itswindow size and/or increase its rate of injection into the SAN Fabricnetwork.

Example 3 Drop frames under normal congestion within SAN Fabric, Dropswhen non-SAN Fabric SAN Fabric transports solely over a SAN Fabricnetwork

Link level back pressure is effectively used strictly for short livedflow control between link segments.

AbnormalCongestionTimer is set to a very low value (e.g., 10s of framesvs. 100s or 1000s).

Frames will be lost under moderate congestion and invoke the implicitcongestion detection, reporting and response mechanism.

Legacy Protocols Solely Over a SAN Fabric

Legacy protocols (e.g., TCP) are sent as Raw Datagram frames. Theseframes do not have a SAN Fabric acknowledgment frame. Allacknowledgments occur at the legacy ULP level.

Frames will be lost due to moderate congestion.

Frame loss will invoke the legacy protocol's injection rate or windowsize reduction algorithms.

SAN Fabric Transport Over Both a SAN Fabric Network and WAN Network(s)

Link level back pressure is effectively used strictly for short livedflow control between link segments.

Should the WAN drop frames due to congestion, the frame Ack timer willexpire and invoke the implicit congestion response mechanism defined insection 18.10.1.3.

Legacy Protocols Over a Mix of SAN Fabric and WAN Networks

Should the WAN drop frames due to congestion, the legacy ULP willtimeout and notice it hasn't received an acknowledgment. It willretransmit frames with a smaller window size and/or with a lower rate ofinjection into the SAN Fabric network.

The legacy ULP driver will upon receiving acknowledgments increase itswindow size and/or increase its rate of injection into the SAN Fabricnetwork.

Congestion Scenarios in Example Topologies

Scenario 1—Singleton Host tree with Adapter Leaves.

A simple tree configuration is generally illustrated at 500 in FIG. 12.This simple tree configuration may cause severe head of line blockingproblems in switch A for adapters A and D. Whether switch A experiencesthese severe problems or not depends on the host's scheduling algorithmand the switch A's congestion control algorithm.

For example, if host A's scheduler doesn't provide a weighted fairschedule queuing (i.e., the host scheduler would use round robinselection for all traffic on the same VL, but would weight traffic forhigher priority VLs higher than the traffic for low-priority VLs) thatcompensates for link bandwidth differences. Weighted fair queuing thatcompensates for link bandwidth differences means the host schedulerwould use round robin selection for all traffic on the same VL, wouldweight traffic for higher priority VLs higher than traffic forlow-priority VLs AND would also weight traffic with the highest minimumpath bandwidth higher than traffic with the lowest minimum pathbandwidth, then when the host has multiple frames to send adapter B or C(or adapter B and C request multiple Read Remote DMA frames from HostA.), the host can cause long periods of head of line blocking byconsuming switch A queue resources. Switch A will free queue resourcesat the link rate of adapters B and C. As a result, the host willexperience periods where no virtual lane credits are available fortransfers to adapters A and D.

Several congestion control mechanisms were considered for sources,including: link level back-pressure, Implicit congestion control basedon Frame-ACK timing and explicit congestion control based on FECN. Ofthe several forms of congestion control mechanisms considered, SANFabric sources must implement the explicit congestion control approach.The following describes how explicit congestion control works underscenario 1. It will also describe the difficulties with the implicitcongestion control approach that was considered.

Use explicit congestion detection by means of FECN back to the sourceand use slow start with multiplicative decrease.

Under this approach, when head of line blocking at switch A occurs, theswitch detects congestion then it marks flits on just the send portsthat have detected Normal Congestion with a FECN. To be clear, theswitch congestion detection process described in this chapter has twocomponents: sender starvation and receiver starvation, both must haveoccurred several times over a NormalCongestionTime period for the switchto be under Normal Congestion. Sender starvation: A switch detects alack of credit at a send port when the switch has a frame queued for thesend port, but has no credits available to send data through that sendport. If this occurs N times during a NormalCongestionPeriod, then thesend port is under congestion. However, this condition alone isinsufficient to differentiate between switch congestion and excessiveflow queue depths. Receiver starvation: For the switch to determine itsunder congestion, the switch also has to determine if it has not beenable to send credits to any one of its neighbors M times during aNormalCongestionPeriod. If both conditions apply, N occurrence of beingout of credits at any send port and M occurrences of being out ofcredits at any receive port, then the switch is under NormalCongestion.If intermediate switches were included in scenario 1, they would need topass through the accumulated FECNs to the next stage in the network. Inscenario 1's configuration, assuming the flows are a result of longlived workload patterns, then links 3 and 4 will get a FECN before links2 and 5. As a result, adapters B and C will ACK back to the destinationto the FECN, but adapters A and D will not. The host will adjust theinjection rates for adapters B and C when they are the cause ofcongestion.

Use implicit congestion detection by means of Frame-to-ACK timing anduse slow-start and multiplicative decrease to respond to congestion.(FN—This is a derivative of TCP Vegas).

Under this approach, when head of line blocking at switch A occurs, theFrame-ACK timing for adapters B and C will appear to be the same as theFrame-ACK timing for adapters C and D. That is, assuming all switch Aflows attempt to fully compete for link 1, then all switch A flows willget their injection rates reduced at the host, not just link 1->3 andlink 1->4 flows. (This is specially true if Host-AdapterB andHost-AdapterC flows are long and occur before Host-AdapterA andHost-AdapterD flow begin.

Once the injection rates have been reduced, and congestion subsized, allflows will again attempt to increase their injection rates. Assuming theflows are a result of relatively long lived workload patterns, then twocases need to be treated: A) all flows set their injection rate increasetime interval to constant; and B) all flows set their injection rateincrease time interval based on a function of the Frame-ACK timingduring uncongested operations. The Frame-ACK timing will be set to adifferent value depending on the flow. For example, in the scenario 1configuration link 1->2 flows will have a much lower frame-ACK value,than link 1->3 flows.

If all flows attempt to increase their injection rates at the sameconstant time interval, then all flows will find the same conditions arestill in effect and the applied load will continue to operate in themiddle (lower portion) of the uncongested region. The reason for thisbeing that link 3 and link 4 will continue to cause HOL as long as allflow increase their injection rates simultaneously. This causes thenetwork throughput to operate at a sub-optimal point in the uncongestedregion.

However, if all flows attempt to increase their injection rates based ona function of the Frame-ACK timing during uncongested operations, thenflows with a higher minimum path bandwidth will increase their injectionrates at a faster rate than the flows with lower minimum pathbandwidths. In this case, link 1->3 and link 1->4 flows will attempt toincrease their injection rates more slowly (longer time period betweeninjection rate increases) than link 1->2 and link 1->5 flows (which usea shorter time period between injection rate increases).

Just using link level back-pressure alone by reducing the number ofcredits available to the host is not very efficient, because the hostcannot determine which flows are under end—end back pressure and whichflows are not. Again, this will cause all switch A flows to operate at asub-optimal point in the uncongested region.

If host A's scheduler provides weighted fair schedule queuing thatcompensates for only static link bandwidth differences, then host A willadjust the injection rate so as to not exceed the lowest link bandwidthrate. For example, the injection rate for host A to adapter B flow wouldbe set to a maximum of the low bandwidth rate; and the injection ratefor host A to adapter A flow would be set to a maximum of the highbandwidth rate. This approach would work fine, as long as theconfiguration is kept to singleton host tree with no peer—peer adaptertransfers and no routers. However, scenario 2 and 3 will describe howstatic flow control is insufficient for a singleton host tree thatcontains routers or adapters performing peer—peer operations.

The main points are as follows:

For a simple tree network, with no peer—peer and no routers into theinternet, dynamic injection rate control using either of the two methodsdescribed above will keep the network operating near the optimal pointof the uncongested region on average, with intermediate periods ofnormal congestion.

For a simple tree network, with no peer—peer and no routers into theinternet, static injection rate control (i.e. host A's schedulerprovides weighted fair schedule queuing that compensates for linkbandwidth differences) is also effective at keeping network operationnear the optimal point in the uncongested region. However, the next twoscenarios will describe why static injection control alone is noteffective at keeping network congestion near the optimal point, if thissimple singleton host network includes peer—peer and routers into theinternet.

Scenario 2—Singleton Host Tree with Peer—Peer Adapter Leaves

This scenario simply adds peer—peer adapter transfers to theconfiguration depicted in scenario 1.

Again, several congestion control mechanisms were considered forsources, including: link level back-pressure, Inplicit congestioncontrol based on Frame-ACK timing and explicit congestion control basedon FECN. Of the several forms of congestion control mechanismsconsidered, SAN Fabric sources must implement the explicit congestioncontrol approach. The following describes how explicit congestioncontrol works under scenario 2. It will describe the difficulties withthe implicit congestion control approach that was considered.

Use explicit congestion detection by means of FECN back to the sourceand use slow start with mulitplicative decrease.

Under this approach, when HOL blocking at switch A occurs, the switchdetects congestion and marks flits with an FECN on just the send portsthat have detected Normal Congestion. Assuming the flows are a result oflong live workload patterns, the links that are responsible for thecongestion will get a FECN before links that are not responsible for thecongestion. As a result, the flows that are responsible for congestionwill get their injection rates reduced before those that are not. Forexample, if host A and adapter A both attempt to fully utilize link 5 byattempting to consume link 5's full bandwidth during transfers toadapter E, then both host A and adapter A will lower their injectionrates and recover from the congestion.

Use implicit congestion detection by means of Frame-to-ACK timing anduse slow-start and mulitplicative decrease to respond to congestion.

Under this approach, when HOL blocking at switch A occurs, the host A'sFrame-ACK timing for link 1->3 and link 1->4 flows will appear to be thesame as the Frame-ACK timing for the link 1->2 and link 1-5>flows.Similarly, adapter A's Frame-ACK timing for link 3->1 and link 3->5 willappear the same. If host A and adapter A each set their injection rateincrease time interval based on a function of the Frame-ACK timingduring uncongested operations, then Normal Congestion problems will bequickly detected and recovered allowing the network to operate in theuncongested region.

But now the main problem with a Frame-ACK timing based (e.g. TCP Vegasstyle dynamic injection rate control surfaces: fairness. There is anenhanced TCP Vegas style injection create control algorithm that isclaimed to improve fairness significantly, but at the cost of greaterinstability. This enhanced algorithm should be analyzed forapplicability. If host A is consuming the full bandwidth available onlink 5, and adapter A begins to also transfer data over link 5, thensoon host A and adapter A will get their injection rates lowered. Ifhost A was operating at a higher rate than adapter A, then it will get alarge share of link 5's bandwidth.

Just using link level back-pressure alone by reducing the number ofcredits available to the host is not very efficient, because the hostcannot determine which flows are under end—end back-pressure and whichflows are not. Again, this will cause all switch A flows to operate at asub-optimal point in the uncongested region.

If host A and adapter A's scheduler provides weighted fair schedulequeuing that compensates for only static link bandwidth differences,then host A and adapter A will not adjust their injection rates whentheir flows conflict and cause normal congestion.

Scenario 3—Singleton Host Tree with Adapter and Router Leaves.

As second simple tree configuration is generally illustrated at 600 inFIG. 13 to illustrate scenario 3. As illustrated in FIG. 13, scenario 3replaces adapter B in the configuration depicted in scenario 1 with arouter (B).

Again, several congestion control mechanisms were considered forsources, including: link level backpressure, Implicit congestion controlbased on Frame-ACK timing and explicit congestion control based on FECN.Of the several forms of congestion control mechanisms considered, SANFabric sources must implement the explicit congestion control approach.The following describes how explicit congestion control works underscenario 3. It will also describe the difficulties with the implicitcongestion control approach that was considered.

Use explicit congestion detection by means of FECN back to the sourceand use slow start with multiplicative decrease.

Under this approach, when HOL blocking at switch A occurs, the switchdetects congestion and marks flits with an FECN on just the send portsthat have detected Normal Congestion. Assuming the flows are a result oflong lived workload patterns, the links that are responsible for thecongestion will get a FECN before links that are not responsiblecongestion. As a result, the flows that are responsible for congestionwill get their injection rates reduced before those that are not. Forexample, if congestion occurs at router B due to a high send rate fromhost A, then the switch will forward a FECN to router B. Router B willreturn the FECN to host A through a No-Op frame. Host A will lower itsinjection rates and the local fabric will recover from the congestion.

Use implicit congestion detection by means of Frame-to-ACK timing anduse slow-start and multiplicative decrease to respond to congestion.

Under this approach, if router B becomes congested it will quickly(through the link level back pressure) cause switch A to becomecongested.

Static injection rate control. If router B is part of a private networkthat is well managed, such that host A can determine all SAN Fabric andnon-SAN Fabric link bandwidths per flow, then host A can adjust theinjection rate so as to not exceed the lowest link bandwidth in use foreach flow over the private network. This approach requires: tightprivate network topology; and the ability for management software toextract the lowest bandwidth link for a flow within the privatenetwork's topology. Given these abilities, the management software canset the injection rates for a source-destination flow that traverse theprivate network. However, this approach is very complicated, but moreimportantly it is ineffective at preventing congestion in the localfabric, because the private network may get congested due to trafficfrom other clients and hosts sharing the private network. If router B isa router tied to the internet, the situation becomes more exasperated.

Although specific embodiments have been illustrated and described hereinfor purposes of description of the preferred embodiment, it will beappreciated by those of ordinary skill in the art that a wide variety ofalternate and/or equivalent implementations calculated to achieve thesame purposes may be substituted for the specific embodiments shown anddescribed without departing from the scope of the present invention.Those with skill in the chemical, mechanical, electromechanical,electrical, and computer arts will readily appreciate that the presentinvention may be implemented in a very wide variety of embodiments. Thisapplication is intended to cover any adaptations or variations of thepreferred embodiments discussed herein. Therefore, it is manifestlyintended that this invention be limited only by the claims and theequivalents thereof.

1. A distributed computer system comprising: links; and end stationscoupled between the links, wherein types of end stations includeendnodes which originate or consume frames and routing devices whichroute frames between the links and do not originate or consume frames,wherein the end stations include a first source endnode which originatesframes at a variable injection rate, wherein the first source endnodeincludes: a congestion control mechanism responding to detectedcongestion by multiplicatively decreasing the variable injection rate,wherein the variable injection rate (IR) is multiplicatively decreasedaccording to IR(i+1)=IR(i)*1/F1, wherein F1 is a constant, wherein IR(i)is equal to a previous variable injection rate and IR(i+1) is equal to anew variable injection rate.
 2. The distributed computer system of claim1 wherein the congestion control mechanism responds to detectedsubsiding of congestion by multiplicatively increasing the variableinjection rate.
 3. The distributed computer system of claim 2 whereinthe variable injection rate (IR) is multiplicatively increased accordingto IR(i+1)=IR(i)*F2, wherein F2 is a constant.
 4. The distributedcomputer system of claim 1 wherein the end stations include a firstdestination endnode which consumes frames originated from the firstsource endnode, wherein the first destination endnode includes: acongestion control mechanism detecting congestion on a path the framesroute from the first source endnode to the first destination endnode. 5.The distributed computer system of claim 4 wherein the first destinationendnode's congestion control mechanism detects congestion based onForward Explicit Congestion Notification (FECN) conditions, and forwardsthe FECN conditions to the first source endnode.
 6. The distributedcomputer system of claim 1 wherein the end stations include a firstdestination endnode which consumes frames originated from the firstsource endnode, wherein the first source endnode's congestion controlmechanism detects congestion on a path the frames route from the firstsource endnode to the first destination endnode by monitoring a previousvariable injection rate and a round trip time for a frame to reach thefirst destination endnode and an acknowledgement (ACK) for the framefrom the first destination endnode to reach the first source endnode. 7.The distributed computer system of claim 1 wherein the first sourceendnode's congestion control mechanism detects congestion on a path theframes route from the first source endnode by monitoring acknowledgement(ACK) timeouts.
 8. The distributed computer system of claim 1 wherein atleast one routing device includes: a congestion control mechanismdetecting congestion on a path the frames route through the at least onerouting device.
 9. The distributed computer system of claim 8 whereinthe at least one routing device includes receive and send portresources, and wherein the at least one routing device's congestioncontrol mechanism detects congestion by analyzing the receive and sendport resources.
 10. Previously Presented) The distributed computersystem of claim 1 wherein at least one routing device includes: acongestion control mechanism responding to detected congestion bydropping frames that are marked droppable for a time period.
 11. Thedistributed computer system of claim 1 wherein at least one routingdevice includes: a congestion control mechanism responding to detectedcongestion by applying link back pressure by reducing a number ofcredits available for routing frames though the routing device from alink.
 12. A method of controlling congestion in a distributed computersystem having links and end stations coupled between the links, whereintypes of end stations include endnodes which originate or consume framesand routing devices which route frames between the links and do notoriginate or consume frames, the method comprising: originating, from afirst source endnode, frames at a variable injection rate; detectingcongestion; and multiplicatively decreasing the variable injection ratein response to the detected congestion including multiplicativelydecreasing the variable injection rate (IR) according toIR(i+1)=IR(i)*1/F1, wherein F1 is a constant, wherein IR(i) is equal toa previous variable injection rate and IR(i+1) is equal to a newvariable injection rate.
 13. The method of claim 12 further comprisingdetecting subsiding of congestion; and multiplicatively increasing thevariable injection rate in response to the detected subsiding ofcongestion.
 14. The method of claim 13 wherein multiplicativelyincreasing the variable injection rate includes multiplicativelyincreasing the variable injection rate (IR) according toIR(i+1)=IR(i)*F2, wherein F2 is a constant.
 15. The method of claim 12further comprising: consuming, at a first destination endnode, framesoriginated from the first source endnode; and detecting congestion on apath the frames route from the first source endnode to the firstdestination endnode.
 16. The method of claim 15 wherein the detectingcongestion on the path the frames route from the first source endnode tothe first destination endnode includes detecting congestion based onForward Explicit Congestion Notification (FECN) conditions, and themethod further comprises: forwarding the FECN conditions to the firstsource endnode.
 17. The method of claim 12 further comprising:consuming; at a first destination endnode, frames originated from thefirst source endnode; and detecting congestion on a path the framesroute from the first source endnode to the first destination endnode bymonitoring a previous variable injection rate and a round trip time fora frame to reach the first destination endnode and an acknowledgement(ACK) for the frame from the first destination endnode to reach thefirst source endnode.
 18. The method of claim 12 wherein the detectingincludes detecting congestion on a path the frames route from the firstsource endnode by monitoring acknowledgement (ACK) timeouts.
 19. Themethod of claim 12 further comprising: detecting congestion on a paththe frames route through the at least one routing device.
 20. The methodof claim 19 wherein the at least one routing device includes receive andsend port resources, and the detecting congestion on a path the framesroute through the at least one routing device includes analyzing thereceive and send port resources.
 21. The method of claim 12 furthercomprising: dropping frames that are marked droppable for a time periodin response to the detected congestion.
 22. The method of claim 12further comprising: applying link back pressure by reducing a number ofcredits available for routing frames though the routing device from alink in response to the detected congestion.
 23. A distributed computersystem comprising: links; and end stations coupled between the links,wherein types of end stations include endnodes which originate orconsume frames and routing devices which route frames between the linksand do not originate or consume frames, wherein the end stations includea first source endnode which originates frames at a variable injectionrate, wherein at least one routing device includes a congestion controlmechanism responding to detected congestion by dropping frames that aremarked droppable for a time period, and wherein the first source endnodeincludes: a congestion control mechanism responding to detectedcongestion by multiplicatively decreasing the variable injection rateand responding to detected subsiding of congestion by multiplicativelyincreasing the variable injection rate, wherein the variable injectionrate (IR) is multiplicatively decreased according to IR(i+1)=IR(i)*1/F1,wherein F1 is a constant, wherein the variable injection rate (IR) ismultiplicatively increased according to IR(i+1)=IR(i)*F2, wherein F2 isa constant, wherein IR(i) is equal to a previous variable injection rateand IR(i+1) is equal to a new variable injection rate.